-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-883] Add caching to Transform pipeline. #884
Conversation
Test PASSed. |
@@ -105,6 +106,10 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs { | |||
var mdTagsFragmentSize: Long = 1000000L | |||
@Args4jOption(required = false, name = "-md_tag_overwrite", usage = "When adding MD tags to reads, overwrite existing incorrect tags.") | |||
var mdTagsOverwrite: Boolean = false | |||
@Args4jOption(required = false, name = "-cache", usage = "Cache data to avoid recomputing between stages.") | |||
var cache: Boolean = false | |||
@Args4jOption(required = false, name = "-storageLevel", usage = "Set the storage level to use for caching.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-storageLevel
→ -storage_level
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; will fix. Thanks!
The Transform pipeline in the CLI has several stages (e.g., sort, indel realignment, BQSR) that trigger recomputation. If you are running a single stage off of local storage/HDFS/Tachyon, this is OK. However, if you're running multiple stages, or you are loading data from S3/etc, this can lead to serious performance degradation. To address this, I've added the proper caching statements. Additionally, I've added a hook so that the user can specify the storage level to use for caching. Resolves bigdatagenomics#883.
Fixed nit and rebased. |
Test PASSed. |
[ADAM-883] Add caching to Transform pipeline.
Thanks! |
The Transform pipeline in the CLI has several stages (e.g., sort, indel
realignment, BQSR) that trigger recomputation. If you are running a single
stage off of local storage/HDFS/Tachyon, this is OK. However, if you're running
multiple stages, or you are loading data from S3/etc, this can lead to serious
performance degradation. To address this, I've added the proper caching
statements. Additionally, I've added a hook so that the user can specify the
storage level to use for caching. Resolves #883.